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Abstract 

Error Tree is a novel tree structure that is mainly oriented to solve the approximate pattern 
matching problems, Hamming and edit distances, as well as the wildcards matching problem. 
The input is a text of length n over a fixed alphabet of length E, a pattern of length m, and k. 
The output is to find all positions that have < k Hamming distance, edit distance, or wildcards 
matching with P. The algorithm proposes for Hamming distance and wildcards matching a tree 

structure that needs words and takes 0{^ + occ){0(m + +occ) in the average 

case) of query time for any online/ofHine pattern, where occ is the number of outputs. As well, 

a tree structure of words and + 3^occ)(0(to + + 3^occ) in the average 

case) query time for edit distance for any online/ofHine pattern. 


1 Introduction 

In the middle of the increasing growth of the internet-based searching, information retrieval, 
data mining applications, and bioinformatics researches; there is an increasing necessity of the 
problem of searching whether a given pattern happened to occur as exact match, approximate 
match, or wildcards match in a given database. The pattern is usually of small size such as 
words or sentences, and the database is of much larger size such as web documents, genomes, 
and books. 

The exact matching problem is the simplest form of the pattern matching problems, while the 
approximate and wildcards matching are more complicated. For the exact matching problem, 
[15] proposed a tree structure which was improved by [12] and [M] and led to an optimal solution 
of linear tree structure and linear query time. 

Approximate matching involves two problems; Hamming distance and edit distance (also 
known as Levenstein distance). The Hamming distance between two strings is the minimal 
number of substitution operations required to transform the first string into the second string. 
While edit distance is the minimal number of substitution, insertion, or deletion operations 
required to transform the first string into the second string. On the other hand, wildcards 
matching problem is when the pattern has a wildcard, also known as don’t care character, 
represented by <&, that can match with any other character in the alphabet set. For these 
matching problems, there is yet no linear (in structure size and query time) solution for them. 

In this paper we are considering the following problems with the following inputs and outputs: 

Problem 1: Dictionary Matching 

Inputs: Given database of N strings, each string s is of length m symbols of finite set of size 
E, and kd = n; A pattern P = piP 2 ---Pm', and K. 

Outputs: All strings in the database that have with P < K Hamming distance (K-Hamming 
distance), < K edit distance (K-edit distance), or < K wildcards matching (K-wildcards match¬ 
ing). 

Problem 2: Text Matching 
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Inputs: Given a text of n symbols of finite set of size S, a pattern P = piP 2 ---Pm, and K. 

Outputs: All positions of the subsequences in the text that have with P K-Hamming distance, 
K-edit distance, or K-wildcards matching. 

Note that problem 1 is a simple case of problem 2. This paper will start describing the data 
structure for problem 1; as it is easier to be described than problem 2. After that, the design 
for problem 2 will be described. 

In this paper an algorithm is introduced for a novel tree structure that mainly leads to an 
efficient bound for the above problems. 

Outline: Section 2 surveys the background and related work of the problems. Then, section 
3 states the preliminaries. Next, section 4 shows the design of the error tree structure for the 
first problem. After that, the modified design for problem 2 will be shown in section 5. Finally, 
section 6 states the conclusions. 


2 Background and Related Work 


For the problem of text indexing, the naive algorithm takes 0{nm) time. A faster algorithm, 
Kangaroo method, was proposed by m that build least common ancestor tree [7] , which allows 
to find each alignment in 0{k) time, hence the time cost will be 0{kn). A better algorithm 
was proposed by [T] that compute all alignments in 0{nJmlogm). The recent algorithm takes 
0{nJklogk) which was proposed by [2]. Note that all the bounds above are in the order of n. 

Recently, [5] proposed a data structure that takes 0 (nwords and find the k- 
Hamming distance in 0{m + 0-iiogn)^iogiogn time, as well as a structure of ) 

words which finds the k-edit distance in 0{m + Oiiogn)^iogiogn where ci, C 2 , C 3 , C 4 > 1 

are constants. The data structure of [5] can output the k-edit distance in -I- occ). 

The construction of their index structure needs 0{nlog^n) words of space in the average case, 
and takes 0{KN'P) of time where N is the number of nodes in the index. An upper bound 
algorithm was proposed by [13], where the space complexity of the index structure is 
for any constant e > 0, but with a query time of 0(m + loglogn + occ) for both k-Hamming 
distance and k-edit distance. 

Many algorithms solved both distances using a lower structure space but using an upper 
query time. Among these algorithms, |4] presented a linear space index 0(n), but with 0{m + 
nloglogn + occ) of query time. In [9], a data structure of 0{nJlognlogYi) bits was 
proposed, that takes 0{Yi^m^{k loglogn) + occ) of query time; while using 0{n) bits of space, 
the query time will be 0{log^n{Yj^m^{k loglogn) + occ)), where 0 < e < 1. Using more space, 
| 8 ] showed an index structure that needs 0{nlogn) bits and takes {Ti^m^max{k, logn) + occ) of 
query time. They also reduced the index space to 0{n) by increasing the query time by factor 
of 0(logn). 

For the k-wildcards matching problem, there is a light reduction in the time and space cost 
over the distances problems. The design of error tree structure in this paper allows solving 
the k-wildcards matching with the same bound of k-Hamming distance. Many algorithms pro¬ 
posed structures for solving the problem such as and m- m proposed a structure of 

Q(^ (fc+iog») .y^fords, that solves the problem in 0{m ‘^ loglogn While |3] generalized the 

structure of | 6 ] and reduced the space to 0{nlognlogp~^n) words, but the query time increased 
to 0(m -I- P^loglogn + occ), where 2 < /3 < E. Less space structure was presented in [IT] , 
where the needed space is 0{nlog^nlogTi) bits, but with slight increase in the query time to 
0{m -I- 2^logn + occ). 

Error Tree is a novel tree structure that is mainly oriented to solve the aforementioned 
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problems. Firstly by constructing a tree structure that cost words for Hamming 

distance and wildcards matching, and takes 0(^ + occ)(0(m -I- -I- occ) in the average 

7 k 

case) of query time for any online/offline pattern. As well a tree structure of 0(2^n words 
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and + 3^occ) (0(m + + 3^occ ) in the average case) of query time for edit distance 

of any online/offline pattern. 


3 Preliminaries 

For a string s, len{s) is the length of string s, s[x : y] is the substring from position x to position 
y, and suff{s,i) is ith suffix of string s. For a list Z, l[i\ is the item at the index i, 1] is the 
item at the last index, similarly l{—2\ is the item before the last item in 1. 

4 Dictionary Matching Algorithm 

In order to explain the steps of the algorithm, the paper will firstly show the steps for k = 1, 
and then explain the general design when k > 2 . 

4.1 k = I’s Case 

The algorithm involves two stages; construction of a tree, then searching for the strings that have 
Hamming distance with P equal to 1, a.k.a strings with 1-Hamming distance or 1-mismatch. 

4.1.1 Construction stage 

1. Firstly, a generalized suffix tree, m, needs to be built for the strings in the database. So, 
by the definition of the suffix tree, we will have leaf node for each suffix of the strings in the 
database. 0 (n) words. 

2 . All leaves and internal nodes in the suffix tree will be assigned a unique key. 0(n) words. 
Definition 1: A suffix tree that has a unique key that identifies each leaf node and each 

internal node in the suffix tree is called keyed suffix tree (KST). 

Definition 2: For a string s and a given KST, the function All Visited Nodes denoted as 
AVN(s) returns a list of the nodes' keys and the edges' lengths with their order that results 
from walking s in the KST. 

Corollary 1: For the case of k = 1 and since in problem 1 we already have leaves for all 
suffixes of the strings in the database, we can find AVN(s)[-l] for any suffix s of any sting in 
the database in a constant time. Because AVN(s)[-l] is the key of the last visited node after 
traversing the suffix s in KST, which must be a leaf node. This can be by hashing all the leaves 
and returning AVN(s)[-f] in a constant time without walking s in the KST. 

Corollary 2: Given a KST, and 2 strings, si and S 2 , where len(si) = len(s 2 ) and si is in 
the KST. If Hamming distance (si, S 2 ) = I, and the mismatch occur at position x where x is 
not the last position, then: AVN(sufF(si, x-|-l)) = AVN(suff(s 2 , x-|-l)). Similarly AVN(sufF(si, 
x-|-l))[-f] = AVN(suff(s 2 , x-|-l))[-f]. 

3. Construct a compact trie of all the strings in the database, 0(n) time and space. 

4. For each internal node v, a hash table Ii is initialized. Then for all leaves L of the subtree 
rooted at v, and assuming v at level (symbol depth) i, then for all I in L, we first pick any string 
s labeled at I, then add to Ii a tuple of {AVN{suff{s,i + f)[—f],Z). So: 

For each internal node v in the tree: 
initialize a hash table 1_1 
// get the level of the node 
i = get_level(v) 

// get node desc 
L = get_desc_leaves(v) 

For 1 in L: 

pick a string s labeled at 1 
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V. I_1.add(AVN(suff(s, i+1) [-1], 1) 


Definition 3: We call such a tree structure, without loss of generality, a 1-error tree as it 
was constructed to find 1 mismatch. 

Eventually, we will perform step 4 for 0{N) internal nodes, and at each node we will be 
bounded to the number of descendants leaves. If the 1-error tree is unbalanced, then we will not 
perform step 4 on the leaves under a branch that is on a heavy path (the path of most descendant 
leaves), namely at each node we have S branches, all descendant leaves of the branch on a heavy 
path will be excluded in step 4 and will be treated in the query stage as edge. This means that 
balanced trie will be the worst case scenario. So, the bound will be O(Nlog^N) words of space. 

4.1.2 Query stage: 

Firstly, given the pattern P, all its suffixes need to be added to the KST, and compute AV iV(.) [— 1] 
for each suffix. So, the results will be the following list R: {AVN(suff(P, 1))[-1], AVN(suff(P, 
2)[-l]),..., AVN(sufr(P, m))[-l]}„. 

Secondly, we will walk P in the 1-error tree as the following: If the walking is on edge, 
and the next symbol in P match with the next symbol, then continue as exact match. If the 
next symbol in P doesn’t match with the next symbol at level j, this means that we reach 1 
mismatch, and we can jump over the next symbol (since the walking is on edge) and continue as 
exact match until we reach a leaf, if any, outputs the strings labeled at that leaf as 1-Hamming 
distance at position j. 

Now, if the walking reaches a node v where v at level j, then look whether the key R[j 1] 
is in the R table of v (constant time cost as R is a hash table). If yes, all strings labeled at 
the leaves that were associated with key R[j I] have I-Hamming distance at position j; then 
continue as exact match and the searching for k<I mismatch. If next symbol in P doesn’t 
match with any child of v, then stop searching. 

4.1.3 Extension for indels 

The design can be extended to handle the operations of insertions and deletions; which means 
we can output all strings that have edit distance of < K=I with P, instead of only the Hamming 
distance. 

Before of all, because insertion and deletion will cause shifting in the suffixes, such shifts 
must be tracked and manipulated by the design of the algorithm; mainly the AVN function. 

If two strings si and S 2 have edit distance of score I caused by deletion operation at position 
X of S 2 , this means that suf f{si,x)[l : m — x — 1] = suff{s 2 ,x I). Now as AVN function 
starts at the root node and must end up at a node that should has a unique key, the design 
should guarantee that. For suf f{s 2 ,x-\-l), it must end up on a node and this will not cause any 
conflict in computing AVN. On the other hand, suff{si,x)[l : m — x—1], which is actually I=k 
level up of suffix suf f{si,x), may cause a conflict because it may not be a leaf node. Therefor, 
this position must be guaranteed to be a leaf node with a unique key. Thus, such a preprocessing 
step must be performed. 

1. For each internal node v in the I-error tree, and for each leaf I of all leaves L of the 
subtree rooted at v, and assuming v at level i. Firstly pick any string s labeled at I, then we 
will walk up by I=k level of the parent node of the leaf node of suff{s, i). If we reach a node, 
let’s say x, we will check if x has a leaf node as a child, if not create new leaf node with unique 
key. If there is no node, then a new node with a unique key will be created, next as a child of 
this new node we will create a leaf node with a unique key. The cost will be 0{Nlog-sN) space 
and time. This will help to track the effects of shifting the suffixes because of the deletions and 
the insertions. 

Insertions and deletions can occur in the pattern or in the strings. Before explaining the 4 
cases, we will need to introduce the following corollary, given the fact that step I was already 
performed: 
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Corollary 3: Given a KST and 2 strings, si and S 2 , where len(si) = len(s 2 ) = m and si is 
in the KST. If edit distance (si, S 2 ) = 1 and the edit operation is a deletion at position x in S 2 
where x is not the last position, then: 

AVN(suff(si, x))[l:m-x-l] = AVN(suff(s 2 , x+1)). Similarly AVN(suff(si, x))[-2] = AVN(sufF(s 2 , 
x+l))[-l]. Note that suff(si, x))[l:m-x-l] will always end up at a leaf node (that must has a 
unique key) because of step 1. 

Corollary 4: Given a KST and 2 strings, si and S 2 , where len(si) = len(s 2 ) = "ni and si is 
in the KST. If edit distance (si, S 2 ) = I and the edit operation is an insertion at position x in 
S 2 where x is not the last position, then: 

AVN(suff(si, x+1)) = AVN(suff(s 2 , x)[l:m-x-l]). Note also that suff(s 2 , x)[l:m-x-l] will al¬ 
ways end up at a leaf node (that must has unique key) because of step 1, so similarly AVN(suff(si, 
x+l))[-l] = AVN(suff(s 2 , x))[-2]. 

2. For the edit distance, the following 4 cases will be possible: 

Case 1: Deletion in the strings; Note that using the Ii table we can handle this case. Based 
on corollary 3, we can check whether AVN(suff(P, i)[-2] in Ii or not. If yes, then all strings 
labeled at leaves that were associated with the key of AVN(suff(P, i) [-2] must have edit distance 
with P as a deletion in them at position i. 

Case 2: Insertion in the strings; For this case, another hash table Iisns need to be initialized. 
Then step 4 of section 4.1.1 will be performed, but instead of adding (AVN(suff(s, i + 1))[-1], 1) 
into I\, (AVN(sufF(s, i))[-2], 1) will be added to Iisns- Note that because of step 1, AVN(suff(s, 
z)) [-2] will always be a leaf node with a unique key. This will allow to check whether AVN(suff(P, 
z + 1)[-1] in Iisns or not, based on corollary 4. If yes, then all strings labeled at leaves that 
were associated with the key AVN(suff(P, z)[-2] must have edit distance with P as an insertion 
in them at position z. 

Before proceeding to the next two cases, note that a deletion in the strings is similar to 
an insertion in the pattern. Likewise, an insertion in the strings is similar to a deletion in the 
pattern. 

Case 3: Deletion in the pattern; there is no need to modify the construction of the 1-error 
tree. This case can be computed by searching AVN(sufI(P, z + 1)[-1] in Iisns- 

Case 4: Insertion in the pattern; this case can be computed by searching AVN(suff(P, z)[-2] 
in Ii. 

In conclusion, at each internal node we will have three tables correspondent to the operations 
of mismatch and insertion. The cost for step 1 will be 0{Nlog^N) words of space and time. 
The cost for step 2, which is computing Iisns will be the same as computing Ii in step 4 of 
section 4.1.1, which is 0{Nlog^N) words of space. 

4.2 K > 2’s Case 

In the case of K=l, the main step in the design is that we associate the key of last node(leaf), 
AVN(.)[-1], of the suffixes with the leaves label. In the K > 2 case, we will associate the keys 
of all the nodes that were returned by AVN(.) for a suffix s, with all tuples, that have s, in the 
Ik -1 tables on the nodes on the path of s in the (k-l)-error tree. Before describing the steps of 
the design, we will state the following corollary: 

Corollary 5: Given a KST and 2 strings, Si and S 2 , where len(si) = len(s 2 ) and Si is 
in the KST. If Hamming distance (si, S 2 ) = k and the mismatches occur at positions pos = 
{pi,P 2 , ■■■,Pk-i,Pk} and at each level of positions in pos in the path to Si there is a node; then: 

AVN{si[l : Pi - 1])) = AVN{s 2[1 : Pi - 1])), 

AVN{si[pi + I:p 2 - 1])) = AVN{s 2 [pi + 1 ■■ P 2 - 1])), 


AVN{si[pk-i + 1 : Pk-i])) = AVN{s 2 [pk-i + 1 '.pk- 1])) 

As well equivalently 

AVNisi[l : Pi - !]))[-!] = AVNis 2 [l : Pi - !]))[-!], 
AVN{si[pi + I:p 2 - !]))[-!] = AVN{s 2 [pi + 1 : P 2 - !]))[-!], 
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AVN{si[pk-i + 1 : pfe_i]))[-l] = AVN{s 2 [pk-i + 1 : Pfe - 1]))[-1] 


4.2.1 Construction stage 

1. The first step is to collect the nodes keys in KST. So, we will do the following for each 
internal node v: 

1.1 At node v, we know the descendants leaves L under that node and at what level 
assuming i. We firstly initialize a hash table /fc, then for each leave I in L we will pick any string 
s. Then, compute AVN(suff(s, i + 1)); note that in the computation of AVN(), If the walking 
is at edge, then AVN() will return the length of that edge and a tag indicating that we are at 
edge. If we walk at node, then it will return the key of that node and a tag indicating that we 
are at a node. Note that we will have 0{logY,n) items in t, since the balanced trie is the worst 
case scenario for the design. 

1.2 After computing AVN(suff(s, i + 1)), first walk in the k-error tree to the leaf I with 
the skipping of 1 levels, and do the following: 


if next node u in AVN(suff(s, i + 1)) is aligned in the middle of an edge: 

v.l_k.add( ((u.keyO, edge), 1)) 

if next node ul in AVN(suff(s, i + 1)) aligned with a node u2: 
for each tuple p in u2.1_k-l that has 1: 

v.l_k.add( (ul.keyO, p[l]),..., p [k] ) ) 


Note that we will not walk explicitly to leaf 1. We will check the alignment between a visited 
node in the path with a node in AVN(suff(s, i + 1)), or the length of edge we visit in the path 
with edge’s length in AVN(suJf(s, i + 1)), which is a simple convolution. So, we will have the 
following cases while walking to leaf I in the k-error tree: 

Case 1: Next node u in AVN(suff(s, i + 1)) is aligned in the middle of edge in the k-error 
tree. Then, add to Ik of v a tuple contains the key of u, a tag indicating the alignment was at 
edge, and /. 

Case 2: Next node u\ in AVN(suff(s, i + 1)) is aligned to a node u2 while walking in the 
k-error tree. Then for each tuple p that has I in Ik-i of u2; we will associate, as a tuple, the 
key of ul and the items in P, in their order, then add the tuple into Ik of v. 

At the end of walking, note that each of a 0{log^n) nodes’ key may get associated/multiplied 

j k-1 

to a 0{ " ) tuples that are in Ik-i during the walking to leaf 1. So eventually, each level 

j k-1 

will cost 0{N ” ) words of space. As we have 0{log^N) levels, the bound will sum up to 

0{Nlog^N^^^^). 

1.3 Steps 1.1 and 1.2 were performed for suff(s, i+1) and on the tables Ik-i and skipping 
1 level. Similarly, we will do the same steps for suff(s, i+2),...,sujf(s, i+k-1) on the tables of 
Ik- 2 ,■■■,Ii and skipping 2 ,..., k- 1 ; respectively. 

2 . So far, we are covering the case where at each internal node, the symbols at the first k — 1 
levels are errors. But this will not cover the case where we will have all the first k symbols after 
the internal nodes are actually errors. For this case, we need to perform step 4 of section 4.1.1, 
for suff(s, i -|- k) instead of suff(s, i -|- 1) suffixes as in the case of K = 1. The cost for this case 
will be as case k = 1, 0{Nlog^N) words of space. 

Eventually, this will lead us to have the cost of constructing k-error tree for any k to be 
0{Nlog-sN ''°^^, " ) words. 
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4.2.2 Query stage 

When fc = 1, the number of possible error positions will be m, as m is the length of P. When 
fc > 2, the number of possible combinations of error positions would be (™), which is bounded 
toO(^). 

Before stating the steps we need to describe the following cases: 

Case 1: Walking a suffix in the KST will diverge at internal node. For this case, the design 
of the algorithm is already covering this case, as we have already marked all internal nodes in 
the KST, back in the error tree. 

Case 2: Walking a suffix in the KST will diverge in the middle of edge. In this case, we 
will allow jumping (skipping next symbol in the edge) k times during the walking at that edge 
(and/or any coming edge). If after < k jumps, the walking ends up at a leaf; then deduct how 
many jumps were performed out of the k mismatches value during the searching process. If 
the walking after < k jumps ends at internal node, then this case is similar to case 1, but with 
deducting how many jumps were performed out of the k mismatches value during the searching 
process. If after exactly k jumps we didn’t reach a node (internal or leaf), this would mean 
that P will have no outputs at all of k mismatehes with any string in the database, as one of 
its suffixes couldn’t reach a leaf or an internal node after allowing k jumps (where jumps are 
representing mismatches). Note that counting the jumps is only at edges and not on any internal 
node, as the algorithm’s design is already marking the internal nodes back in the error tree, and 
the jumps (assumingly errors) after these internal nodes are already accounted for in the design. 

As a result of these cases we will define the following function: 

Definition 4: For a string s, an integer K, and a given KST, the function All Visited Nodes 
with k jumps denoted as AVNJ(s, k) returns a list of the nodes' keys and the edges' lengths 
with their order that resulted from walking s in the KST with allowing k jumps (in case of 
mismatch) on only the edge; and the positions of the jumps, if occurred. 

1. We will collect AVNJ(s, k) for each suffix si, 52, Sm of the pattern. After that, we will 
have m lists. Let’s call this list R. 

2. Walk the pattern P in the k-error tree, then at each internal node v, assuming v is at 

level i. We will search if R has any of the combinations of keys which can be extracted 

from the R list. If so, report the leaves’ labels that were associated with the keys combination 
as the output. If we walk on edge, we will perform skip (jump) over mismatches that we may 
reach in a simple convolution. 

k 

The overall bound for querying k mismatches would be O(^) + occ) time. Note that in the 
average case computing AVN(.) for all suffixes would traverse 0{log^n) nodes, therefore we get 

bound of 0(m + + occ). 

4.2.3 Extension for indels 

In order to handle the insertions and the deletions for any k, we will need to consider the 
following modifications: 

1. Guaranteeing that we have a leaf node at the k level above all leaves in KST. For this, 
we will visit k level above each leaf and perform step I in section 4.1.3. Note that during the 
construction of 1-error tree to (k-l)-error tree, we must have already created leaves nodes for 
each case of 1 to k-1. 

3. For insertions: At each internal node, we will need to perform the steps in section 4.2.1 
not for sujf(s, i)[l:m-i-l]; where i is the level of the node, instead of sujf(s, i)[l:m-i-k-l], then 
add the results into Ik.ins table. 

4. Likewise, we need to perform step 2 of section 4.2.1, but for sujf(s, i)[l:m-i-k]. 

5. Note that the edit distance can be any combination of substitution, deletion, or insertion. 
For this, we will perform step 1.2, of section 4.2.1, for all the tables at node, namely Ik and 
Ik.ins tables, not only R; then adding the results into a hash table Ik.edu- This will add up an 
extra space of 2^ words. So, the total cost for building k-error tree that handle k edit distance 
will be 0{2’^Nlog^N ^°^l~^ )). 
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6 . The number of combinations that will be needed to search for will increase in the factor 
of 3*, hence the query time for edit distance will be +3^occ)(0(m+ 

average case). 


+ 3^occ) in the 


5 Algorithm Design for Text Indexing 

The design and the construction of the error tree for problem 2 are similar to the design and the 
construction of problem 1, but there are some differences. Here, we describe these differences 
and what preprocessing steps will be needed to resolve them in order to be able to apply the 
same design of problem 1 . 

1. The depth of all the paths in the suffix tree is not < m. Paths with > m depths are 
useless when we search for a pattern of length m. As well, this will add more costs in backward 
traversing of the tree and during the creation of new nodes. 

2. In problem 1, we have already leaves nodes for each suffix of the strings. In problem 2 
this is not the case, because we have just one text string unlike problem I. The leaves for the 
suffixes of each suffix (or specifically m-mer) are not explicitly constructed. 

Now in order to resolve these issues; we will perform the following: 

1. All paths in the suffix tree must have depth < m. For this, we need to traverse all paths 
in the suffix tree and count the depth of the path by summing the lengths of the edges on each 
path. When depth = m is reached, then if that point is already a node, trim all edges/nodes 
below that node, and store explicitly the labels of the descendants leaves. If that point is on 
edge, then create a leaf node at that point, after that copy and store explicitly the labels of the 
descendants leaves of its sink node, lastly trim the edge below that point. The cost of this step 
will be 0{n) time and space since we don’t need to read the edges’ symbols, instead we will just 
need to read the length of the edges (constant time), as well we will create 0{n) new nodes. We 
call such a suffix tree Trimmed Suffix Tree of depth m, TSTm- 

2. Starting from TSTm, we need to mark/tag suffixes of these suffixes similarly to the design 
of problem 1. Note that in problem 1 not all m suffixes of the strings were considered in the 
design, since we only computed the AVN(.) for the suffixes of the descendants leaves under the 
internal nodes. Thus, 0{nlogY.n) of suffixes will be under consideration not 0(nm). 

Now in order to resolve this, note that after performing step 1 and for instance, the 6 th 
suffix of suffix at position 1020 will be the prefix from root to position m — 6 of the suffix 1026. 
By this, the cost to guarantee/create a leaf node for the 6 th suffix of suffix 1020 is to start from 
suffix 1026 leaf and walk as far as position m — 6 , then make sure we have a leaf node there or 
create a new one. Again, there is no need to walk on the edge explicitly to reach point m — 6 , 
as reading the length of the edges is enough. Hence, the cost will be 0{log^n). In conclusion, 
guaranteeing/creating leaf node for each considered suffixes at the internal nodes will need an 
extra cost of 0{log^n) time. 

3. There is no need to build another compressed trie for the text in this problem. As we 
may consider the TSTm as enough representation for all the k error trees. So, all operations 
and all the k error trees can be constructed within the TSTm or using independent trees. 

So, after making these modifications, we can build the k-error trees using the same steps in 
problem 1 , and the cost will be 0 (n ^°^f" ) words. 

There is an exception for case k=l, because this case will need 0(nlogffii) time but only 
0{nlogsn) words of space; the reason is that we need to find the AVN()[-1] for the considered 
suffixes at each internal node in 0{log^n) time, but we will just store the AVN()[-1] value which 
is of a constant space. 


6 Conclusion 

In this paper, we introduce a tree structure that allows to solve non-trivial problems using 
very efficient bounds. For the problems of K-Hamming distance and K-wildcards matching, we 



propose a structure that needs words and takes 0{^ + occ){0{m + + occ) in 

k 

the average case) of query time for any online/offline pattern. A tree structure of 0{2^n 

words and 0{^ + i^occ){0{m + + i^occ) in the average case) of query time for edit 

distance and any online/offline pattern. 
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